Main feature(s) of interest in dataset?
By this time we can’t choose more or less important variables. We’ll do this a bit later after creating correlation and plot matrices.
In this project we analyse white wine in R. We’ll try to find patterns in objective wine properties like pH, alcohol and etc. We’ll also explore relations of this properties and subjective assessment of wine quality (ranking by expert tasters). In the end we’ll build linear regression model and try to predict quality of wine by wine chemical properties.
First of all we load all packages we need for our exploration and our data into R. If you want to do it on your local machine you have to install all the packages we want to load first and copy csv file with wine data (wineQualityWhites.csv) to you working directory.
Let’s look at our data.
## 'data.frame': 4898 obs. of 13 variables:
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : Factor w/ 7 levels "3","4","5","6",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ quality_int : int 6 6 6 6 6 6 6 6 6 6 ...
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
## Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
##
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.00900 Min. : 2.00 Min. : 9.0
## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0
## Median :0.04300 Median : 34.00 Median :134.0
## Mean :0.04577 Mean : 35.31 Mean :138.4
## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0
## Max. :0.34600 Max. :289.00 Max. :440.0
##
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50
## Median :0.9937 Median :3.180 Median :0.4700 Median :10.40
## Mean :0.9940 Mean :3.188 Mean :0.4898 Mean :10.51
## 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40
## Max. :1.0390 Max. :3.820 Max. :1.0800 Max. :14.20
##
## quality quality_int
## 3: 20 Min. :3.000
## 4: 163 1st Qu.:5.000
## 5:1457 Median :6.000
## 6:2198 Mean :5.878
## 7: 880 3rd Qu.:6.000
## 8: 175 Max. :9.000
## 9: 5
We have 11 dependent variables which represent different chemical features of white wine and output variable which is wine quality. Last variable is equal to wine quality but it stored in dataframe as factor. We’ve got 4898 observations. All of them represents different brands and types of white wines.
Now I want to plot histograms and barcharts of all variables to get an idea about their distributions.
Next is Volatile acidity. Distribution of this variable is bell-shaped. We can see that there are high quantity of outliers. And there is no need to modify this variable.
Next is Volatile acidity. Distribution of this variable is bell-shaped. We can see that there are high quantity of outliers. And there is no need to modify this variable.
Next is citric acidity. And again distribution of this variable is bell-shaped. We can see that there are high quantity of outliers. And there is no need to modify this variable. I also want to emphasize that first three variable represent acidity and could be highly correlated. We should pay due attention to this in our further analysis.
Now it’s time to analyse residual sugar. Distribution of this variable on the fourth plot doesn’t look approximately normally distributed. But we can modify it by taking log of our variable. I’ll add log of residual sugar to our dataset because it could fit our model better then original one.
Next variable is chlorides. Distribution of this variable is bell-shaped. We can see that there are high quantity of outliers on the right. I’ve cut 2% of maximum values to make our plots look more detailed.
Next variable is free sulfur dioxide. Distribution of this variable is bell-shaped. We can see that there are some outliers. I’ve cut 1% of maximum values to make our plots look more detailed.
Next variable is total sulfur dioxide. This variable is potentially highly correlated to free sulfur dioxide. Distribution of this variable is bell-shaped and skewed to left. We can see that there are a few outliers. I’ve cut 0.5% of maximum values to make our plots look more detailed.
Next variable is density of wine. Distribution of this variable is bell-shaped. I’ve cut 0.1% of maximum values to make our plots look more detailed.
Next three variable pretty the same. They all are approximately normally distributed and have some outliers.
Our last variable is quality. It’s different from all others. This is not continuous variable, but categorical. On the bar chart below we can see that most of wines have quality of 5.
We’ve transformed residual sugar and cut outliers of chlorides, free.sulfur.dioxide, total.sulfur.dioxide and density. Now we can see patterns of our data more clearly. Main conclusion of this section is that distribution of most variable are bell-shaped (approximately normally distributed).
There are 4,898 observations in our dataset all of which represent white wine brand. For each observation we have 12 features (Fixed Acidity, Volatile Acidity, Citric Acid, Residual Sugar, Chlorides, Free Sulfur Dioxide, Total Sulfur Dioxide, Density, pH, Sulphates, Alcohol, and Quality). All variables are numeric.
All of our variable except alcohol and residual sugar are approximately normally distributed (bell-shaped). Alcohol is positively skewed. Log of residual sugar bimodal and positively skewed.
By this time we can’t choose more or less important variables. We’ll do this a bit later after creating correlation and plot matrices.
We added a factor variable of Quality and log transformed residual.sugar . We’ll add some buckets in the further part of analysis.
Most of our variables is approximately normally distributed with lots of outliers. After log transformation residual.sugar variable looks more bell-shaped with 2 modes round 1.2 and 9.
I want to start bivariate analysis section by creating correlation and plot matrix. This will help us choose important variables for our further exploration. Plot matrix is built from small sample of our data because building this matrix is time consuming activity. That’s why you could get slightly different result if doing on your local machine.
After seeing correlation and plot matrix we can observe highest correlations in pairs residual.sugar - density, density - alcohol, free.sulfur.dioxide - total.sulfur.dioxide, fixed.acidity - pH. Let’s make scatterplots of these pairs.
We can observe clear positive correlation on our first pair density and residual sugar. More dense type of wines tend to contain more residual sugar. Coefficient of correlation of 0.857 confirms our inference.
On the second plot we can see negative correlation between alcohol and density. Our conclusion is more alcoholic wines are less dense. Coefficient of correlation is -0.678.
Total sulfur dioxide and free sulfur dioxide are positively correlated as we assumed in univariate plots section. Coefficient of correlation is 0.651.
In spite of pretty high absolute value of coefficient of correlation (-0.427) I can’t see clear relation in pair pH and fixed acidity.
Now I want to scrutinize relation of quality to density and alcohol. These two variables have highest correlation with quality. For each of these variables I’ll build scatterplot, box plots for every quality level and summary table for every quality level.
## $`3`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9911 0.9925 0.9944 0.9949 0.9969 1.0000
##
## $`4`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9892 0.9926 0.9941 0.9943 0.9958 1.0000
##
## $`5`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9872 0.9933 0.9953 0.9953 0.9972 1.0020
##
## $`6`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9876 0.9917 0.9937 0.9940 0.9959 1.0390
##
## $`7`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9906 0.9918 0.9925 0.9937 1.0000
##
## $`8`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9903 0.9916 0.9922 0.9935 1.0010
##
## $`9`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9896 0.9898 0.9903 0.9915 0.9906 0.9970
Here we can see that wines with quality level of 5 tend to be more dense and wines with quality level more than 5 getting less dense with every next level.
## $`3`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.55 10.45 10.34 11.00 12.60
##
## $`4`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.40 10.10 10.15 10.75 13.50
##
## $`5`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.000 9.200 9.500 9.809 10.300 13.600
##
## $`6`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.50 9.60 10.50 10.58 11.40 14.00
##
## $`7`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.60 10.60 11.40 11.37 12.30 14.20
##
## $`8`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.50 11.00 12.00 11.64 12.60 14.00
##
## $`9`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.40 12.40 12.50 12.18 12.70 12.90
Here is the opposite picture. wines with quality level of 5 tend to contain less alcohol. Every next level quality make wine more alcoholic.
In this section we found out some interesting patterns of our data. Here is some of the observations:
We should take these facts into consideration when building regression model because strong correlation among input variables lead to multicollinearity and instability in model.
On the other side correlation of quality with other variables looks not very strong. This fact suggests that our model wouldn’t be very good. We’ll verify that guess in next section.
Most of variables except stated above have low or average correlation with each other and low with quality of wine.
Density and residual sugar have the strongest linear relationship in data. Coefficient of correlation of this to variables is sulfur 0.84. That’s very strong correlation.
Now I’ll create few bins. We’ll use them in our further analysis.
First two plot of this section show us that density depends on alcohol and residual sugar. Next two plots show that density depends on fixed acidity too.
I’ve made last plot to show how quality depends on density and alcohol bins.
After scrutinizing data it’s time for model building. First of all I want to show how powerful could be prediction models even so simple as linear regression model. Let’s build linear regression model to predict density. In our first try we’ll use all of available variables to build our model.
##
## Call:
## lm(formula = density ~ ., data = wines_for_prediction)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0050586 -0.0003002 -0.0000425 0.0002463 0.0215427
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.858e-01 2.589e-04 3808.450 < 2e-16 ***
## fixed.acidity 7.666e-04 1.106e-05 69.339 < 2e-16 ***
## volatile.acidity 4.698e-04 8.685e-05 5.409 6.63e-08 ***
## citric.acid 3.372e-04 7.121e-05 4.735 2.25e-06 ***
## residual.sugar 3.737e-04 1.910e-06 195.701 < 2e-16 ***
## chlorides 4.613e-03 4.020e-04 11.477 < 2e-16 ***
## free.sulfur.dioxide -6.485e-06 6.235e-07 -10.401 < 2e-16 ***
## total.sulfur.dioxide 3.813e-06 2.765e-07 13.793 < 2e-16 ***
## pH 3.482e-03 6.116e-05 56.926 < 2e-16 ***
## sulphates 1.447e-03 7.221e-05 20.032 < 2e-16 ***
## alcohol -1.096e-03 9.188e-06 -119.259 < 2e-16 ***
## quality -8.348e-05 1.060e-05 -7.879 4.04e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.00056 on 4886 degrees of freedom
## Multiple R-squared: 0.965, Adjusted R-squared: 0.9649
## F-statistic: 1.226e+04 on 11 and 4886 DF, p-value: < 2.2e-16
This model looks good at first sight. We have extremely high R-squared and all variables are statistically significant. But this model has multicollinearity and in my opinion 11 exogenous variable is too much for liner regression model. I’ll skip few steps where I removed 7 variable for different reasons. Here is my final model.
##
## Call:
## lm(formula = density ~ fixed.acidity + residual.sugar + total.sulfur.dioxide +
## alcohol, data = wines_for_prediction)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0035347 -0.0004565 -0.0001072 0.0003363 0.0251100
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.994e-01 1.671e-04 5982.07 <2e-16 ***
## fixed.acidity 5.327e-04 1.319e-05 40.39 <2e-16 ***
## residual.sugar 3.463e-04 2.518e-06 137.52 <2e-16 ***
## total.sulfur.dioxide 5.042e-06 3.002e-07 16.80 <2e-16 ***
## alcohol -1.131e-03 1.066e-05 -106.04 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0007722 on 4893 degrees of freedom
## Multiple R-squared: 0.9334, Adjusted R-squared: 0.9333
## F-statistic: 1.714e+04 on 4 and 4893 DF, p-value: < 2.2e-16
This model has bit lower R-squared, but in my opinion it is much better.
Now I want to come back to our main goal. Prediction of quality of wine. Again I’ll build this model using all variables.
##
## Call:
## lm(formula = quality ~ ., data = wines_for_prediction)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.8348 -0.4934 -0.0379 0.4637 3.1143
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.502e+02 1.880e+01 7.987 1.71e-15 ***
## fixed.acidity 6.552e-02 2.087e-02 3.139 0.00171 **
## volatile.acidity -1.863e+00 1.138e-01 -16.373 < 2e-16 ***
## citric.acid 2.209e-02 9.577e-02 0.231 0.81759
## residual.sugar 8.148e-02 7.527e-03 10.825 < 2e-16 ***
## chlorides -2.473e-01 5.465e-01 -0.452 0.65097
## free.sulfur.dioxide 3.733e-03 8.441e-04 4.422 9.99e-06 ***
## total.sulfur.dioxide -2.857e-04 3.781e-04 -0.756 0.44979
## density -1.503e+02 1.907e+01 -7.879 4.04e-15 ***
## pH 6.863e-01 1.054e-01 6.513 8.10e-11 ***
## sulphates 6.315e-01 1.004e-01 6.291 3.44e-10 ***
## alcohol 1.935e-01 2.422e-02 7.988 1.70e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7514 on 4886 degrees of freedom
## Multiple R-squared: 0.2819, Adjusted R-squared: 0.2803
## F-statistic: 174.3 on 11 and 4886 DF, p-value: < 2.2e-16
We got model with R-squared equals 0.2819. Even though we used all the variables R-squared is pretty low. Here is reasonable linear regression model (in my opinion).
##
## Call:
## lm(formula = quality ~ volatile.acidity + residual.sugar + alcohol +
## density, data = wines_for_prediction)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3401 -0.5052 -0.0317 0.4723 3.1304
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 74.224822 11.977000 6.197 6.22e-10 ***
## volatile.acidity -2.059334 0.108919 -18.907 < 2e-16 ***
## residual.sugar 0.052299 0.004914 10.642 < 2e-16 ***
## alcohol 0.286371 0.017747 16.136 < 2e-16 ***
## density -71.546483 11.922692 -6.001 2.10e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7601 on 4893 degrees of freedom
## Multiple R-squared: 0.2639, Adjusted R-squared: 0.2633
## F-statistic: 438.6 on 4 and 4893 DF, p-value: < 2.2e-16
My final model is not perfect too. There is high multicollinearity among variables. Nonetheless R-squared is very low. We can’t use this models for prediction of wine quality.
We found out that there are lot of variables with high correlation to density. We also were able to built nice model for prediction of density of wine. But quality is not so predictable. We don’t have lots of correlated variables and our model wasn’t very good.
The most interesting conclusion is quality of wine is difficult to predict. That’s easy to predict some other variables.
We built two models. First model is for prediction of density of wine. This model has R-squared of 0.93. I can say that this model is strong. But if we want to use it for predictions we should explore relationship of exogenous variables more and remove variables that lead to multicollinearity.
Second model is for prediction of quality of wine. R-squared of this model is low (0.26). We can’t use it for prediction purposes.
In this section I want to show 3 most important plots from this project.
This plot is important for our analysis. It shows how much different types of wine of different quality levels we have in our data. There is no wines with quality of 1,2 and 10. Quality distribution is bell-shaped. That means most of our data is represented by average quality wine.
This is plot of the most correlated variable in dataset to quality. It shows that wine with quality level of 5 tend to contain less alcohol. Each next (previous) level tend to contain more alcohol than previous (next) one. It’s easy to notice that quality and alcohol is related, but this relationship is not linear.
On this plot we can see how both Alcohol and Sugar influence on density of wine. The higher alcohol rate, the lower density of wine. Sugar and density have different relation. The higher sugar content, the higher density of wine. Although we can observe that wines with high alcohol rate tend to contain less residual sugar.
In this project we analysed white wine dataset and tried to build model for prediction of wine quality. We’ve found many patterns in wine chemical properties and even built regression model for prediction of density of wine. Density is easy predictable variable by other chemical properties of wine and it was enough to use simple linear regression model. On the other side we weren’t able to build model for prediction of wine quality. In my opinion quality is emotional phenomenon and depends on taste, mood and habit, but not on chemical properties.
So after sketchy analysis of wine dataset we can come up with two main conclusions:
Secondary conclusions is:
Our analysis wasn’t full. We could enrich it in future by using different prediction models: random forest, regression trees and etc.